Skip to main content

02 - Defining Statistics and Data

I opened up three statistics textbooks and found that they jumped right into the mathematical fray. But I think it’s useful to take a step back and understand what statistics actually is. The University of California - Irvine (UCI) Department of Statistics defines it as:

Statistics is the science concerned with developing and studying methods for collecting, analyzing, interpreting and presenting empirical data

Good start, but what exactly is data? Unfortunately, the Oxford dictionary refers to data as ‘facts and statistics collected together for reference or analysis.’ Another definition is that ‘data are facts and statistics used to analyze something or make decisions.’ Take five minutes of looking up definitions of data and you’ll likely end up more confused than when you started.

I think the definition of statistics from UCI is a good starting point - it’s a set of structured methods to understand the data that we have available. Data is information - in a perfect world all our information would be perfectly collected and structured to immediately answer all of society’s pressing questions. However, part of statistics is ferreting out bias and erroneous information, and then understanding the confidence you have in the insights you’ve gleaned from the data.

The simplest explanation I could come up with:

Statistics is the structured method of working with information to take away actionable insights while understanding the constraints and limitations of the information available.

One important note is that statistics has blended together with computer science. Any business might have Six Sigma specialists, data analysts, data scientists, data engineers, and machine learning engineers, all of which incorporate statistics. Before the computer age, statisticians would have to do calculations by hand, thus limiting the their scope and reach (although elemental statistics developed over a century ago has largely stood the test of time). Today, machine learning problems commonly have billions of data points, requiring efficient algorithms and operators well versed in both statistics and computer science. You cannot do meaningful applied statistics without computers.

As a data and geospatial engineer with one foot in business / entrepreneurship and the other foot in academia, the statistics you cover in a high-quality college-level textbook is incredibly important and immediately applicable to so many of the projects throughout your career. It’s just my opinion, but beyond that, the more advanced statistics is applicable in certain use cases, but begins to be less relevant to the real-world.

Master the basics.